EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres

نویسندگان

  • Steffen Remus
  • Gerold Hintz
  • Christian Biemann
  • Christian M. Meyer
  • Darina Benikova
  • Judith Eckle-Kohler
  • Margot Mieskes
  • Thomas Arnold
چکیده

We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Automatic Linguistic Annotation of ComputerMediated Communication / Social Media”. Our system is based on a rulebased tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the system is robust across the two tested genres: German computer mediated communication (CMC) and general German web data (WEB). We achieve the second rank in three of four scenarios. Also, the presented systems are freely available as open source components.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text

We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging...

متن کامل

bot.zen $@$ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)

This article describes the system that participated in the Part-of-speech tagging subtask of the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / so-

متن کامل

Part-of-speech tagging of Modern Hebrew text

Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If...

متن کامل

Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF

We present a character-based model for joint segmentation and POS tagging for Chinese. The bidirectional RNN-CRF architecture for general sequence tagging is adapted and applied with novel vector representations of Chinese characters that capture rich contextual information and sub-character level features. The proposed model is extensively evaluated and compared with a state-of-the-art tagger ...

متن کامل

The Effect of Automatic Tokenization, Vocalization, Stemming, and {POS} Tagging on {A}rabic Dependency Parsing

We use an automatic pipeline of word tokenization, stemming, POS tagging, and vocalization to perform real-world Arabic dependency parsing. In spite of the high accuracy on the modules, the very few errors in tokenization, which reaches an accuracy of 99.34%, lead to a drop of more than 10% in parsing, indicating that no high quality dependency parsing of Arabic, and possibly other morphologica...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016